Infection by human immunodeficiency virus (HIV) may lead to acquired immunodeficiency syndrome (AIDS) and continues to be a global public health problem, with an estimated 37 million individuals worldwide who are HIV-positive. New York City has the oldest and largest HIV epidemic in the United States, and it also leads the nation in the number of new HIV cases nowadays.
In this project, we aim to study the relationships between different characteristics of patients and the HIV/AIDS Diagnosis Outcome in New York City from 2011 to 2015.
a) Is there significant association between HIV/AIDS diagnoses in New York City and demographic characteristics such as gender, age, neighborhoods, and median income?
—Using visuals and regression model to study
b) Is there correlation between the HIV diagnoses counts and median income in every specific neighborhood in NYC?
—Using map plot and regression model to study
c) What current social phenomenon can be reflected from the distribution of the HIV diagnosis and what is the public health implication of NYC? Is the conclusions generalizable to other place in US?
—Looking into relevant papers and combined what we have learned in the Epi1
Main: HIV/AIDS Diagnoses by Neighborhood, Age Group, and Race/Ethnicity from NYC open data. https://data.cityofnewyork.us/Health/DOHMH-HIV-AIDS-Annual-Report/fju2-rdad
Shapefile of NYC Zip Codes - tabulation areas provided by NYC Department of Information Technology & Telecommunications (DOITT) http://data.beta.nyc/dataset/3bf5fb73-edb5-4b05-bb29-7c95f4a727fc/resource/6df127b1-6d04-4bb7-b983-07402a2c3f90
Zip code of United Hospital Fund neighborhood https://www1.nyc.gov/assets/doh/downloads/pdf/data/appb.pdf
Personal Income data Raw data comes from 2011-2015 American Community Survey (ACS) Public Use Microdata Sample (PUMS). https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2011_2015&prodType=document The raw dataset is super large with hundreds of variables, so we select it based on our target variables - location and total person income. The selected dataset is saved as “selected_pums.csv” in the data folder under our R project. The location data from 2011 ACS is based on Public use microdata area code (PUMA) 2000, while the definition for PUMA 2000 is nowhere to be found. This is why we exclude the income data from 2011. For the 2012-2015 PUMS data, we transfer the PUMA 2010 into zipcode for a better visualization on the NYC map. The transform is based on the following two datasets. ZCTA10 to PUMA10 http://faculty.baruch.cuny.edu/geoportal/resources/nyc_geog/nyc_zcta10_to_puma10.xls ZIP code to ZCTA10 http://faculty.baruch.cuny.edu/geoportal/resources/nyc_geog/zip_to_zcta10_nyc_revised.xls
id = "18ouzlI-Hs3t_DJW4rGUQwvHS0BurTmCV"
UHF_zipcode =
read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id)) %>%
slice(-43) %>%
select(-Borough) %>%
rename("UHF" = "UHF.Neighborhood") %>%
janitor::clean_names()
raw_hiv =
GET("https://data.cityofnewyork.us/api/views/fju2-rdad/rows.csv") %>%
content("parsed") %>%
janitor::clean_names()
combine_hiv =
right_join(UHF_zipcode, raw_hiv, by = "uhf") %>%
janitor::clean_names() %>%
separate(zip_code, into = c("zipcode1", "zipcode2", "zipcode3",
"zipcode4", "zipcode5", "zipcode6", "zipcode7", "zipcode8",
"zipcode9"), sep = ", ") %>%
gather(key = zip_code, value = zipcode_value, zipcode1:zipcode9) %>%
filter(!is.na(zipcode_value)) %>%
rename("zipcode" = "zipcode_value") %>%
select(zipcode, everything(), -zip_code)
r = GET('http://data.beta.nyc//dataset/3bf5fb73-edb5-4b05-bb29-7c95f4a727fc/resource/6df127b1-6d04-4bb7-b983-07402a2c3f90/download/f4129d9aa6dd4281bc98d0f701629b76nyczipcodetabulationareas.geojson')
nyc_zipcode = readOGR(content(r,'text'), 'OGRGeoJSON', verbose = F)
zipcode_lat_lng = nyc_zipcode@data %>%
select(zipcode = postalCode, longitude, latitude) %>%
mutate(zipcode = as.character(zipcode))
combine_hiv1 =
full_join(zipcode_lat_lng, combine_hiv, by = "zipcode") %>%
mutate(longitude = as.numeric(longitude), latitude = as.numeric(latitude)) %>%
group_by(uhf) %>%
summarise(lng = mean(longitude),
lat = mean(latitude)) %>%
filter(!(uhf == "Pelhem - Throgs Neck"))
id1 = "1qGkhF7qXf_Qa8YTePLgqvEYfVLWX1tFs"
pums_raw = read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id1))
temp = tempfile(fileext = ".xls")
dataURL = "http://faculty.baruch.cuny.edu/geoportal/resources/nyc_geog/nyc_zcta10_to_puma10.xls"
download.file(dataURL, destfile = temp, mode = 'wb')
zcta_to_puma = readxl::read_xls(temp, sheet = 2) %>%
select(zcta = zcta10, puma = puma10) %>%
mutate(puma = as.numeric(puma))
dataURL = "http://faculty.baruch.cuny.edu/geoportal/resources/nyc_geog/zip_to_zcta10_nyc_revised.xls"
download.file(dataURL, destfile = temp, mode = 'wb')
zip_to_zcta = readxl::read_xls(temp, sheet = 2) %>%
select(zipcode, zcta = zcta5)
pums_data =
pums_raw %>%
select(puma = PUMA10, income = PINCP, year = ADJINC) %>%
filter(puma != -9) # remove data from 2011 due to lack of area information
pums_data$year = recode(pums_data$year,
"1042852" = "2012",
"1025215" = "2013",
"1009585" = "2014",
"1001264" = "2015")
pums_data =
pums_data %>%
group_by(year, puma) %>%
summarise(mid_income = median(income, na.rm = TRUE)) %>%
ungroup() # calculate yearly median income for each area
puma_to_zipcode = right_join(zip_to_zcta, zcta_to_puma, by = "zcta") %>% # generaate a puma to zipcode file
select(puma, zipcode)
income_zipcode = right_join(pums_data, puma_to_zipcode, by = "puma") %>% # matching zipcode with median income data
select(year, zipcode, mid_income) %>%
mutate(year = as.numeric(year))
combine_hiv_income =
left_join(combine_hiv, income_zipcode, by = c("year", "zipcode"))
hiv_data =
raw_hiv %>%
mutate(year = as.character(year), age = as.factor(age))
neb_plot = hiv_data %>%
group_by(uhf, gender) %>%
filter(year != "ALL", borough != "All", uhf != "All", gender != "All") %>%
filter(age != "All") %>%
summarise(sum_hiv = sum(hiv_diagnoses)) %>%
ggplot(aes(x = reorder(uhf, sum_hiv), y = sum_hiv, color = gender)) +
coord_flip() +
geom_point() +
labs(
title = "Figure 1.Gender and Neighborhood Influence on HIV Incidence",
x = "Neighborhood",
y = "HIV diagnoses"
)
ggplotly(neb_plot)
The number of HIV diagnoses is apperently higher among male subgroups than female in all neighborhoods. Beford Stuyvesant - Crown Heights have the highest total HIV diagnoses and highest female HIV diagnoses cases. Chelsea - Clinton ranks first in male HIV diagnoses. Bayside - Little Neck has lowest number of HIV diagnoses for both male and female.
age_plot = hiv_data %>%
filter(race == "All" & borough == "All" & age != "All") %>%
group_by(gender, age) %>%
summarise(sum_hiv = sum(hiv_diagnoses)) %>%
ggplot(aes(y = sum_hiv, x = age, fill = gender)) +
geom_bar(stat = "identity", alpha = 0.8, position = position_dodge()) +
scale_fill_brewer(palette = "Dark2") +
labs(
title = "Figure 2.Gender and Age Influence on HIV Incidence",
x = "Age range",
y = "HIV diagnoses"
)
ggplotly(age_plot)
As is shown in Figure.2, in every age range, HIV incidence rate in male is significantly higher than that in female population. The potential explaination could be the gender differences with respect to HIV/AIDS depend on patterns of disease transmission. Most infections occurred in adults aged 20 to 29 years, and the incidence porpotion declines as the increase of age.
race_plot = hiv_data %>%
filter(age == "All" & borough == "All" & race != "All") %>%
group_by(gender, race) %>%
summarise(sum_hiv = sum(hiv_diagnoses)) %>%
ggplot(aes(y = sum_hiv, x = reorder(race, sum_hiv), fill = gender)) +
geom_bar(stat = "identity", alpha = 0.8, position = position_dodge()) +
scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
labs(
title = "Figure 3.1 Race and Gender Influence on HIV Incidence",
x = "Race",
y = "HIV diagnoses"
)
ggplotly(race_plot)
By race/ethnicity, black men/women have the highest rates of new HIV infections among all men/women. Whereas the incidence rate among Asian/Pacific Islander is the lowest given the study population in NYC. This is because some race population groups have higher rates of HIV in their communities, thus raising the risk of new infections with each sexual or drug use encounter. Plus, social, economic, and demographic factors of various race group—such as stigma, discrimination, income, education, and geographic region—could also affect their risk for HIV.
race_plot = hiv_data %>%
filter(age == "All" & borough == "All" & race != "All" & hiv_related_death_rate!= 99999) %>%
group_by(gender, race) %>%
summarise(sum_hiv = sum(hiv_related_death_rate)) %>%
ggplot(aes(y = sum_hiv, x = reorder(race, sum_hiv), fill = gender)) +
geom_bar(stat = "identity", alpha = 0.8, position = position_dodge()) +
scale_fill_manual(values = c("goldenrod3", "steelblue3")) +
labs(
title = "Figure 3.2 Race and Gender Influence on HIV-related-death rate",
x = "Race",
y = "HIV diagnoses",
caption = "Data from the ..."
)
ggplotly(race_plot)
However, if we compare the plot of HIV Incidence with the plot of HIV related death rate, we will notice that although the HIV incidence in male is much higher than female, the HIV related death rate in women is higher than that of men in some race, like White and Latino/Hispanic. We can draw the conclusion that women infected with HIV disease are more vulnerable than men with HIV disease. Other possible explaination could be male are more likely to get access to the HIV treatment than women, reflecting the gender inequality of social status.
overall_year = hiv_data %>%
filter(borough == "All", uhf == "All", age == "All", race == "All") %>%
group_by(year, gender) %>%
summarize(sum_hiv = sum(hiv_diagnoses)) %>%
ggplot(aes(x = year, y = sum_hiv, group = gender, color = gender)) +
geom_line() +
labs(
title = "Figure 4.Yearly change trend on HIV Incidence",
x = "Year",
y = "Sum of HIV diagnoses",
caption = "Data from the ..."
)
ggplotly(overall_year)
Figure 4 shows the overall declining trend of total HIV incidence among study population in NYC from 2011 to 2015. The decrease in male is much significant than in female, and not very obvious in transgender group for their low base account. The downward trend in HIV incidence rate reflects the improvment of public health practice affect the HIV incidence rate through repeated exposure to counseling (such as the promotion of condom use or safe sex or other prevention messages) and the advances in HIV treatments.
income_plot = combine_hiv_income %>%
filter(year != "2011") %>%
group_by(uhf, year) %>%
summarise(sum_hiv = sum(hiv_diagnoses), mid_in = median(mid_income)) %>%
ggplot(aes(x = mid_in, y = sum_hiv)) +
facet_grid( ~ year) +
geom_point() +
geom_smooth(method = lm) +
theme_bw() +
theme(legend.position = "None") +
labs(
title = "Figure 5.Income Influence on HIV Incidence",
x = "Median Income of each neighborhood",
y = "HIV diagnoses"
)
ggplotly(income_plot)
dis = combine_hiv_income %>%
filter(year != "2011" & age == "All" & gender == "All" & race == "All") %>%
distinct(uhf, year, .keep_all = TRUE)
cor(dis$hiv_diagnoses, dis$mid_income)
## [1] -0.2791455
Interpretation: the correlation between hiv_diagnoses and mid_income is -0.279, which indicates overall HIV incidence decreases as the overall median income increases.
In Figure 5, it is obvious that the points are mostly concentrated in low income neigbourhood and the number of HIV diagnoses in high average income area( > 60000/year) centered in less than 1000 cases/neigbourhood. This result is exactly what we expected, because low-income community are tend to have insufficient healthcare supply, less insurance coverage, poor education level and inadequate epidemiology awareness, which could jointly cause the relatively high HIV incidence. So, we intend to advocate the related authorities to spare more public health resource in low-to-median-income neigbourhood to raise public awareness of HIV prevention knowledge and increase budgets of medical facilities in those areas.
Limitations:
We can not visualize the effect of age and race simultaneously.
hiv_data_reg = raw_hiv %>%
rename(neighborhood = uhf) %>%
filter(year != "ALL", borough != "All", neighborhood != "All", gender != "All") %>%
mutate(year = as.character(year), age = as.factor(age), gender = as.factor(gender))
income_hiv = combine_hiv_income
income_hiv %>%
filter(year != "2011" & age != "All") %>%
lm(hiv_diagnoses ~ borough + gender + age + mid_income, data = .) %>%
summary() %>%
broom::tidy() %>%
knitr::kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.983 | 0.302 | 3.252 | 0.001 |
| boroughBrooklyn | 0.297 | 0.281 | 1.060 | 0.289 |
| boroughManhattan | 3.091 | 0.331 | 9.332 | 0.000 |
| boroughQueens | -1.245 | 0.259 | -4.811 | 0.000 |
| boroughStaten Island | -4.376 | 0.397 | -11.016 | 0.000 |
| genderMale | 6.083 | 0.152 | 40.138 | 0.000 |
| age20 - 29 | 9.600 | 0.262 | 36.576 | 0.000 |
| age30 - 39 | 6.870 | 0.262 | 26.175 | 0.000 |
| age40 - 49 | 4.627 | 0.262 | 17.627 | 0.000 |
| age50 - 59 | 2.355 | 0.262 | 8.972 | 0.000 |
| age60+ | 0.427 | 0.262 | 1.626 | 0.104 |
| mid_income | 0.000 | 0.000 | -17.851 | 0.000 |
income_hiv %>%
filter(year != "2011" & race != "All") %>%
mutate(race = fct_relevel(race, "White")) %>%
lm(hiv_diagnoses ~ borough + gender + race + mid_income, data = .) %>%
summary() %>%
broom::tidy() %>%
knitr::kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 5.143 | 0.514 | 10.001 | 0.000 |
| boroughBrooklyn | 0.357 | 0.493 | 0.724 | 0.469 |
| boroughManhattan | 3.710 | 0.582 | 6.376 | 0.000 |
| boroughQueens | -1.494 | 0.454 | -3.287 | 0.001 |
| boroughStaten Island | -5.251 | 0.698 | -7.527 | 0.000 |
| genderMale | 7.299 | 0.266 | 27.425 | 0.000 |
| raceAsian/Pacific Islander | -3.628 | 0.421 | -8.621 | 0.000 |
| raceBlack | 7.304 | 0.421 | 17.357 | 0.000 |
| raceLatino/Hispanic | 5.399 | 0.421 | 12.829 | 0.000 |
| raceOther/Unknown | -5.008 | 0.421 | -11.900 | 0.000 |
| mid_income | 0.000 | 0.000 | -12.197 | 0.000 |
Interpretation: HIV incidence among male are significantly higher than that among female, which can be explained by gender differences with respect to HIV/AIDS depend on patterns of disease transmission. HIV incidence in adults aged 20 to 29, 30 to 39, 40 to 49 and 50 to 59 are all significantly higher than young people aged 13 to 19. HIV incidence in Black and Latino/Hispanic population are significantly higher than that in white population. The effect of income is not significant.
diagnoses_by_zipcode =
combine_hiv_income %>%
filter(gender == "All", age == "All", race == "All") %>%
mutate(zipcode = factor(zipcode)) %>%
group_by(zipcode, uhf) %>%
summarise(sum_diagnoses = sum(hiv_diagnoses), sum_diagnosis_rate = sum(hiv_diagnosis_rate))
map_data = geo_join(nyc_zipcode, diagnoses_by_zipcode, "postalCode", "zipcode")
points_spdf = combine_hiv1 %>%
filter(!is.na(lng))
coordinates(points_spdf) = ~lng + lat
proj4string(points_spdf) = proj4string(nyc_zipcode)
matches = over(points_spdf, nyc_zipcode)
points = full_join(matches, rename(diagnoses_by_zipcode, postalCode = zipcode), by = "postalCode")
pal1 = colorNumeric(palette = "Reds", domain = range(map_data@data$sum_diagnoses, na.rm = T))
# "BuPu" "viridis" "Greens""inferno"
leaflet(map_data) %>%
addTiles() %>%
addPolygons(color = "black", weight = 1,
fillColor = ~pal1(sum_diagnoses), fillOpacity = 0.8,
popup = ~stringr::str_c(uhf, " sum:", factor(sum_diagnoses))) %>%
addMarkers(~longitude, ~latitude,
popup = ~stringr::str_c(uhf, " sum:", factor(sum_diagnoses)), data = points) %>%
addProviderTiles("CartoDB.Positron") %>%
addLegend(position = "bottomright", pal = pal1, values = ~sum_diagnoses,
title = "The amount of HIV diagnoses", opacity = 0.8) %>%
setView(-73.98, 40.75, zoom = 11)
pal2 = colorNumeric(palette = "Reds", domain = range(map_data@data$sum_diagnosis_rate, na.rm = T))
# "BuPu" "viridis" "Greens""inferno"
leaflet(map_data) %>%
addTiles() %>%
addPolygons(color = "black", weight = 1,
fillColor = ~pal1(sum_diagnosis_rate), fillOpacity = 0.8,
popup = ~stringr::str_c(uhf, " sum:", factor(sum_diagnosis_rate))) %>%
addMarkers(~longitude, ~latitude,
popup = ~stringr::str_c(uhf, " sum:", factor(sum_diagnosis_rate)), data = points) %>%
addProviderTiles("CartoDB.Positron") %>%
addLegend(position = "bottomright", pal = pal2, values = ~sum_diagnosis_rate,
title = "The amount of HIV diagnosis rate", opacity = 0.8) %>%
setView(-73.98, 40.75, zoom = 11)
income_by_zipcode =
combine_hiv_income %>%
filter(year != "2011") %>%
filter(gender == "All", age == "All", race == "All") %>%
group_by(zipcode, uhf) %>%
summarise(mean_income = mean(mid_income))
map_data_income = geo_join(nyc_zipcode, income_by_zipcode, "postalCode", "zipcode")
pal3 = colorNumeric(palette = "Greens", domain = range(map_data_income@data$mean_income, na.rm = T))
leaflet(map_data_income) %>%
addTiles() %>%
addPolygons(color = "black", weight = 1,
fillColor = ~pal3(mean_income), fillOpacity = 0.8,
popup = ~stringr::str_c(uhf, " sum:", factor(mean_income))) %>%
addProviderTiles("CartoDB.Positron") %>%
addLegend(position = "bottomright", pal = pal3, values = ~mean_income,
title = "Mean income of each uhf", opacity = 0.8) %>%
setView(-73.98, 40.75, zoom = 11)
Findings are included in the comments of each plot.
Future Direction:
We can find the HIV diagnosis situation in different places, including states, cities and towns, or even different countries, representing different level of GDP, development and lifestyle. So that the conclusion will be more comprehensive.
Moreover, we could also include HIV diagnosis distribution during longer time period, and conduct Time Series Model analysis to better illustrate the development of HIV new cases rate.